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ABSTRACT 

In this paper, we attempt to revisit the problem of multi-party con- 
ferencing from a practical perspective, and to rethink the design 
space involved in this problem. We believe that an emphasis on 
low end-to-end delays between any two parties in the conference 
is a must, and the source sending rate in a session should adapt 
to bandwidth availability and congestion. We present Celerity, a 
multi-party conferencing solution specifically designed to achieve 
our objectives. It is entirely Peer-to-Peer (P2P), and as such elim- 
inating the cost of maintaining centrally administered servers. It 
is designed to deliver video with low end-to-end delays, at quality 
levels commensurate with available network resources over arbi- 
trary network topologies where bottlenecks can be anywhere in the 
network. This is in contrast to commonly assumed P2P scenarios 
where bandwidth bottlenecks reside only at the edge of the net- 
work. The highlight in our design is a distributed and adaptive rate 
control protocol, that can discover and adapt to arbitrary topolo- 
gies and network conditions quickly, converging to efficient link 
rate allocations allowed by the underlying network. In accordance 
with adaptive link rate control, source video encoding rates are also 
dynamically controlled to optimize video quality in arbitrary and 
unpredictable network conditions. We have implemented Celerity 
in a prototype system, and demonstrate its superior performance 
over existing solutions in a local experimental testbed and over the 
Internet. 



1. INTRODUCTION 

With the availability of front-facing cameras in high-end smart- 
phone devices (such as the Samsung Galaxy S and the iPhone 4), 
notebook computers, and HDTVs, multi-party video conferencing, 
which involves more than two participants in a live conferencing 
session, has attracted a significant amount of interest from the in- 
dustry. Skype, for example, has recently launched a monthly-paid 
service supporting multi-party video conferencing in its latest ver- 
sion (Skype 5) |T). Skype video conferencing has also been re- 
cently supported in a range of new Skype-enabled televisions, such 
as the Panasonic VIERA series, so that full-screen high-definition 
video conferencing can be enjoyed in one's living room. Moreover, 
Google has supported multi-party video conferencing in its latest 
social network service Google+. And Facebook cooperating with 
Skype plans to provide video conferencing service to its billions 
of users. We argue that these new conferencing solutions have the 
potential to provide an immersive human-to-human communica- 
tion experience among remote participants. Such an argument has 



been corroborated by many industry leaders: Cisco predicts that 
video conferencing and tele-presence traffic will increase ten-fold 
between 2008-2013 (2). 

While traffic flows in a live multi-party conferencing session are 
fundamentally represented by a multi-way communication process, 
today's design of multi-party video conferencing systems are engi- 
neered in practice by composing communication primitives (e.g., 
transport protocols) over uni-directional feed-forward links, with 
primitive feedback mechanisms such as various forms of acknowl- 
edgments in TCP variants or custom UDP-based protocols. We 
believe that a high-quality protocol design must harness the full po- 
tential of the multi-way communication paradigm, and must guar- 
antee the stringent requirements of low end-to-end delays, with the 
highest possible source coding rates that can be supported by dy- 
namic network conditions over the Internet. 

From the industry perspective, known designs of commercially 
available multi-party conferencing solutions are either largely server- 
based, e.g., Microsoft Office Communicator, or are separated into 
multiple point-to-point sessions (this approach is called Simulcast), 
e.g., Apple iChat. Server-based solutions are susceptible to central 
resource bottlenecks, and as such scalability becomes a main con- 
cern when multiple conferences are to be supported concurrently. 
In the Simulcast approach, each user splits its uplink bandwidth 
equally among all receivers and streams to each receiver separately. 
Though simple to implement, Simulcast suffers from poor quality 
of service. Specifically, peers with low upload capacity are forced 
to use a low video rate that degrades the overall experience of the 
other peers. 

In the academic literature, there are recently several studies on 
peer-to-peer (P2P) video conferencing from a utility maximization 
perspective [3^8]. Among them, Li et al. (Jj and Chen et al. (4) 
may be the most related ones to this work (we call their unified 
approach Mutualcast). They have tried to support content distribu- 
tion and multi-party video conferencing in multicast sessions, by 
maximizing aggregate application-specific utility and the utiliza- 
tion of node uplink bandwidth in P2P networks. Specific depth- 1 
and depth-2 tree topologies have been constructed using tree pack- 
ing, and rate control was performed in each of the tree-based one- 
to-many sessions. However, they only considered the limited sce- 
nario where bandwidth bottlenecks reside at the edge of the net- 
work, while in practice bandwidth bottlenecks can easily reside in 
the core of the network [9, JO]. Further, all existing industrial and 
academic solutions, including Mutualcast, did not explicitly con- 
sider bounded delay in designs, and can lead to unsatisfied interac- 
tive conferencing experience. 



1.1 Contribution 

In this paper, we reconsider the design space in multi-party video 
conferencing solutions, and present Celerity, a new multi-party con- 
ferencing solution specifically designed to maintain low end-to-end 
delays while maximizing source coding rates in a session. Celerity 
has the following salient features: 

• It operates in a pure P2P manner, and as such eliminating the 
cost of maintaining centrally administered servers. 

• It can deliver video at quality levels commensurate with avail- 
able network resources over arbitrary network topologies, 
while maintaining bounded end-to-end delays. 

• It can automatically adapt to unpredictable network dynam- 
ics, such as cross traffic and abrupt link failures, allowing 
smooth conferencing experience. 

Enabling the above features for multi-party conferencing is chal- 
lenging. First, it requires a non-trivial formulation that allows sys- 
tematic solution design over arbitrary network capacity constraints. 
In contrast, existing P2P system design works with performance 
guarantee commonly assume bandwidth bottlenecks reside at the 
edge of the network. Second, maximizing session rates subject to 
bounded delay is known to be NP-Complete and hard to solve ap- 
proximately |lll We take a practical approach in this paper that ex- 
plores all 2-hop delay-bounded overlay trees with polynomial com- 
plexity. Third, detecting and reacting to network dynamics without 
a priori knowledge of the network conditions are non-trivial. We 
use both delay and loss as congestion measures and adapt the ses- 
sion rates with respect to both of them, allowing early detection and 
fast response to unpredictable network dynamics. 

The highlight in our design is a distributed rate control proto- 
col, that can discover and adapt to arbitrary topologies and network 
conditions quickly, converging to efficient link rate allocations al- 
lowed by the underlying network. In accordance with adaptive link 
rate control, source video encoding rates are also dynamically con- 
trolled to optimize video quality in arbitrary and unpredictable un- 
derlay network conditions. 

The design of Celerity is largely inspired by our new formula- 
tion that specifically takes into account arbitrary network capacity 
constraints and allows us to explore design space beyond those in 
existing solutions. Our formulation is overlay link based and has 
a number of variables linear in the number of overlay links. This 
is a significant reduction as compared to the number of variables 
exponential in the number of overlay links in an alternative tree- 
based formulation. We believe our approach is applicable to other 
P2P system problems, to allow solution design beyond the common 
assumption in P2P scenarios that the bandwidth bottlenecks reside 
only at the edge of the network. 

We have implemented a prototype Celerity system using C++. 
By extensive experiments in a local experimental testbed and on 
the Internet, we demonstrate the superior performance of Celerity 
over state-of-the-art solutions Simulcast and Mutualcast. 

1.2 Paper Organization 

The rest of this paper is organized as follows. In Section [2] we 
introduce a general formulation for the multi-party conferencing 
problem; existing solutions can be considered as algorithms solv- 
ing its special cases. We present and discuss the designs of two crit- 
ical components of Celerity, the tree packing module and the link 
rate control module, in Sections[3]and|4] respectively. We present 
the practical implementation of Celerity in Section[5]and the exper- 
imental results in Section[6] Finally, we conclude in Section|7] We 
leave all the proofs and pseudo codes in the Appendix. 



Notation 


Definition 


£ 


Set of all physical links 


V 


Set of conference participating nodes 


E 


Set of directed overlay links 


c. 


Capacity of the physical link / 


a/,e 


Whether overlay link e passes physical link / 


(~m,e 


Rate allocated to session m on overlay link e 


Cm 


Overlay link rates of stream in, c„, = [c me ,e e E] 


C 


Overlay link rates of all streams, c = [c[, . . . , c^] T 


y 


Total overlay link traffics, y = 2»f=i c >" 


D 


Delay bound 


Rm(C m ,D) 


Session m's rate within the delay bound D 


qiiz) 


Price function of violating link /'s capacity constraint 


Pi 


Lagrange multiplier of link /'s capacity constraint 


&{c,p) 


Lagrange function of variables c and p 



Note: we use bold symbols to denote vectors. 
Table 1: Key notations. 

2. PROBLEM FORMULATION AND CELER- 
ITY OVERVIEW 

One way to design a multi-party conferencing system is to for- 
mulate its fundamental design problem, explore powerful theoreti- 
cal techniques to solve the problem, and use the obtained insights 
to guide practical system designs. In this way, we can also be clear 
about potential and limitation of the designs, allowing easy system 
tuning and further systematic improvements. Table Q] lists the key 
notations used in this paper. 

2.1 Settings 

Consider a network modeled as a directed graph G = (N, -£), 
where N is the set of all physical nodes, including conference par- 
ticipating nodes and other intermediate nodes such as routers, and 
£, is the set of all physical links. Each link / 6 _£ has a nonnegative 
capacity C/ and a nonnegative propagation delay d t . 

Consider a multi-party conferencing system over G. We use V C 
N to denote the set of all conference participating nodes. Every 
node in V is a source and at the same time a receiver for every other 
nodes. Thus there are totally M - \V\ sessions of (audio/video) 
streams. Each stream is generated at a source node, say v, and needs 
to be delivered to all the rest nodes in V- jv|, by using overlay links 
between any two nodes in V. 

An overlay link (u, v) means u can send data to v by setting up 
a TCP/UDP connection, along an underlay path from it to v pre- 
assigned by routing protocols. Let E be the set of all directed over- 
lay links. For all e e E and I € _£, we define 



t-'l.e 



1 , if overlay link e passes physical link /; 
0, otherwise. 



(1) 



The physical link capacity constraints are then expressed as 



T 

a,y 



Z a « Z ' 



< C, V/ e L, 



where c,„_ e denotes the rate allocated to session m on overlay link e 
and ajy describes the total overlay traffic passing through physical 
link /. 

Remark: In our model, the capacity bottleneck can be anywhere 
in the network, not necessarily at the edges. This is in contrast to 
a common assumption made in previous P2P works that the up- 
links/downlinks of participating nodes are the only capacity bottle- 
neck. 



2.2 Problem Formulation 

In a multi-party conferencing system, each session source broad- 
casts its stream to all receivers over a complete overlay graph on 
which every link e has a rate c„ v and a delay 2/ex ai.ed\. For smooth 
conferencing experience, the total delay of delivering a packet from 
the source to any receiver, traversing one or multiple overlay links, 
cannot exceed a delay bound D. 

A fundamental design problem is to maximize the overall con- 
ferencing experience, by properly allocating the overlay link rates 
to the streams subject to physical link capacity constraints. We for- 
mulate the problem as a network utility maximization problem: 



MP: max c > ^U m {R m (c m ,D)~) 



s.t. 



ajy < C, VI e £. 



(2) 



(3) 



The optimization variables are c and the constraints in $3$ are the 
physical link capacity constraints. 

R,„(c m ,D) denotes session m's rate that we obtain by using re- 
source c„, within the delay bound D, and is a concave function of 
c m as we will show in Corollary[T|in the next section. 

The objective is to maximize the aggregate system utility. U m (R m ) 
is an increasing and strictly concave function that maps the stream 
rate to an application-specific utility. For example, a commonly 
used video quality measure Peak Signal-to-Noise Ratio (PSNR) 
can be modeled by using a logarithmic function as the utility (4) 
[J. With these settings and observations, U m (R m ) is concave in c and 
the problem MP is a concave optimization problem. 

Remarks: (i) The formulation of MP is an overlay link based 
formulation in which the number of variables per session is \E\ and 
thus at most \V\ 2 . One can write an equivalent tree-based formula- 
tion for MP but the number of variables per session will be expo- 
nential in \E\ and |V|. (ii) Existing solutions, such as Simulcast and 
Mutualcast, can be thought as algorithms solving special cases of 
the problem MP. For example, Simulcast can be thought as solving 
the problem MP by using only the 1-hop tree to broadcast content 
within a session. Mutualcast can be thought as solving a special 
case of the problem MP (with the uplinks of participating nodes 
being the only capacity bottleneck) by packing certain depth- 1 and 
depth-2 trees within a session. 

2.3 Celerity Overview 

Celerity builds upon two main modules to maximize the sys- 
tem utility: (1) a delay-bounded video delivery module to distribute 
video at high rate given overlay link rates (i.e., how to compute and 
achieve R„,(c m , £))); (2) a link rate control module to determine c,„. 

Video delivery under known link constraints: This problem is 
similar to the classic multicast problem, and packing spanning (or 
Steiner) trees at the multicast source is a popular solution. How- 
ever, the unique "delay-bounded" requirement in multi-party con- 
ferencing makes the problem more challenging. We introduce a 
delay-bounded tree packing algorithm to tackle this problem (de- 
tailed in Section[3]l. 

Link rate control: In principle, one can first infer the network 
constraints and then solve the problem MP centrally. However, 
directly inferring the constraints potentially requires knowing the 
entire network topology and is highly challenging. In Celerity, we 
resort to design adaptive and iterative algorithms for solving the 
problem MP in a distributed manner, without a priori knowledge 
of the network conditions (detailed in Section|4)- 

'Using logarithmic functions also guarantees (weighted) propor- 
tional fairness amon g se ssions and thus no session will starve at 
the optimal solution 1121 . 
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Figure 1: An illustrating example of 4-party (A, 6, C, and D) con- 
ferencing over a dumbbell underlay topology. E and F are two 
routers. Solid lines represent underlay physical links. To make 
the graph easy to read, we use one solid line to represent a pair of 
directed physical links. Dash lines represent overlay links. 



We high-levelly explain how Celerity works in a 4-party confer- 
encing example in Fig. Q] We focus on session A, in which source 
A distributes its stream to receivers B, C, and D, by packing delay- 
bounded trees over a complete overlay graph shown in the figure. 
We focus on source A and one overlay link (B, C), which represents 
a UDP connection over an underlay path B to E to F to C. Other 
overlay links and other sessions are similar. 

We first describe the control plane operations. For the overlay 
link (B, C), the head node B works with the tail node C to peri- 
odically adjust the session rate c AB ^ c according to Celerity's link 
rate control algorithm. Such adjustment utilizes control-plane in- 
formation that source A piggybacks with data packets, and loss and 
delay statistics experienced by packets traveling from B to C. We 
show such local adjustments at every overlay link result in globally 
optimal session rates. 

The head node B also periodically reports to source A the ses- 
sion rate c AB ^ c and the end-to-end delay from B to C. Based on 
these reports from all overlay links, source A periodically packs 
delay-bounded trees using Celerity's tree-packing algorithm, cal- 
culates necessary control-plane information, and delivers data and 
the control-plane information along the trees. 

The data plane operations are simple. Celerity uses delay-bounded 
trees to distribute data in a session. Nodes on every tree forward 
packets from its upstream parent to its downstream children, fol- 
lowing the "next-children" tree-routing information embedded in 
the packet header. Celerity's tree-packing algorithm guarantees that 
(i) packets arrive at all receivers within the delay bound, and (ii) the 
total rate of a session m passing through an overlay link e does not 
exceed the allocated rate c me . 

In the following two sections, we first present the designs of the 
two main modules in Celerity. We then describe how they are im- 
plemented in physical peers in Section|5] 

3. PACKING DELAY-BOUNDED TREES 

Given the link rate vector c,„ and delay for every overlay link e 
(i.e.,2/e£ ai, e di), achieving the maximum broadcast/multicast stream 
rate under a delay bound D is a challenging problem. A general 
way to explore the broadcast/multicast rate under delay bounds is to 
pack delay-bounded Steiner trees. However, such problem is NP- 
hard 11131 . Moreover, the number of delay-bounded Steiner trees to 
consider is in general exponential in the network size. 

In this paper, we pack 2-hop delay-bounded trees in an overlay 
graph of session m, denoted by D,„, to achieve a good stream rate 
under a delay bound. Note by graph theory notations, a 2-hop tree 
has a depth at most 2. Packing 2-hop trees is easy to implement. 
It also explores all overlay links between source and receiver and 




Figure 2: Illustration of the directed acyclic sub-graph over which 
we pack delay-bounded 2-hop trees. 



between receivers, thus trying to utilize resource efficiently. In fact, 
it is shown in (3]0 that packing 2-hop multicast trees suffices to 
achieve the maximum multicast rate for certain P2P topologies. We 
elaborate our tree-packing scheme in the following. 

We first define the overlay graph £),„. Graph D„, is a directed 
acyclic graph with two layers; one example of such graph is illus- 
trated in Fig. [2] In this example, consider a session with a source 
s, three receivers 1,2,3. For each receiver i, we draw two nodes, 
r, and t h in the graph D m ; t t models the receiving functionality of 
node i and r, models the relaying functionality of node i. 

Suppose that the prescribed link bit rates are given by the vector 
c m , with the capacity for link e being c„,_ c . Then in D,„, the link 
from s to r, has capacity c„ hS ^ n , the link from r, to t ,■ (with i j= j) 
has capacity c„,_ r ._,, , and the link from r, to f, has infinite capacity. 
If the propagation delay of an edge e exceeds the delay bound, we 
do not include it in the graph. If the propagation delay of a two-hop 
path s — > r, — > tj exceeds the delay bound, we omit the edge from 
r, to t j from the graph. As a result, every path from s to any receiver 
tj in the graph has a path propagation delay within the delay bound. 

Over such 2-layer sub-graph D,„, we pack 2-hop trees connect- 
ing the source and every receiver using the greedy algorithm pro- 
posed in 1141 . Below we simply describe the algorithm and more 
details can be found in 1141 . 

Assuming all edges have unit-capacity and allowing multiple 
edges for each ordered node pair. The algorithm packs unit-capacity 
trees one by one. Each unit-capacity tree is constructed by greed- 
ily constructing a tree edge by edge starting from the source and 
augmenting towards all receivers. It is similar to the greedy tree- 
packing algorithm based on Prim's algorithm. The distinction lies 
in the rule of selecting the edge among all potential edges. The edge 
whose removal leads to least reduction in the multicast capacity of 
the residual graph is chosen in the greedy algorithm. 

We show a simple example to illustrate how the tree packing al- 
gorithm works. Fig.[3]shows the process of packing a unit-capacity 
tree over a 2-layer sub-graph. In this example, s is source and t\,t%, 
t 3 are three receivers, each edge from s to r, (i = 1, 2, 3) and from 
r, to tj (i t j) has unit capacity. The 00 associated with the edge 
between r, and f, means the edge has infinite capacity. 

The tree packing algorithm maintains a "connected set", denoted 
by T, that contains all the nodes that can be reached from s during 
the tree construction process. Initially, T = {s} contains only the 
source s. In each step, the algorithm adds and connects one more 
node to the tree and appends the node into T. The algorithm finds 
a tree when T contains all the receivers. 

Seen from Fig. [3] in Step 1, the algorithm evaluates the links 
starting from source and greedily picks the edge whose removal 
gives the smallest reduction of the multicast capacity in the residual 
graph. In this example, any edge leaving s can be chosen because 



their removals give the same reduction. Our algorithm randomly 
picks one such equally-good edge, in this case say edge .? — > r\. 
The algorithm adds node r\ into T and amends it to be T = [s, r\ \. 

In Step 2, the algorithm evaluates the edges originated from any 
node in T. In this case it picks edge r\ — > t\ and amends T to be 
{s, ri,ti}. The algorithm repeats the process until all the receivers 
are in T, which is Step 4 in this example. The algorithm then 
successfully constructs a unit-capacity tree s — > r\ — » {tiJiJi)- 
Afterwards, the algorithm resets T = {s} and constructs next tree 
in the residual graph until no unit-capacity tree can be further con- 
structed. 

The above greedy algorithms is very simple to implement and its 
practical implementation details are further discussed in Section|5] 

Utilizing the special structure of the graph D,„, we obtain perfor- 
mance guarantee of the algorithm as follows. 

Theorem 1. The tree-packing algorithm in M4V achieves the 
minimum of the min-cuts separating the source and receivers in 
D,„ and is expressed as 



R m (c,„, D) = min ^ min {c„ 



(4) 



Furthermore, the algorithm has a running time of 0(\V\\E\ 2 ). 



Proof: Refer to Appendix A. 

Hence, our tree-packing algorithm achieves the maximum delay- 
bounded multicast rate over the 2-layer sub-graph D m . The achieved 
rate R m (c,„, D) is a concave function of c,„ as summarized below. 

Corollary 1 . The delay -bounded multicast rate R,„(c m ,D) ob- 
tained by our tree-packing algorithm is a concave function of the 
overlay link rates c m . 

Proof. Refer to Appendix B. 

3.1 Pack Delay-bounded Trees With Helpers 
Existing 

In the previous discussion, we do not involve helpers(a helper 
node is neither a source nor a receiver in the conferencing session, 
but it is willing to help in distributing content) in our tree packing 
algorithm. Actually, this tree packing algorithm can also achieve 
the minimum of the min-cuts separating the source and receivers in 
£>„, even though there exist helpers. 

To see how the tree packing algorithm can be applied to D m 
which includes helpers, we firstly define the 2-layer sub-graph D,„ 
with helpers existing; one example of such graph is illustrated in 
Fig. |4] In this example, consider a session with a source s, three 
receivers 1,2,3, and a heper h\. Similarly, for each receiver i, we 
draw two nodes, r, and f,, in the graph D,„; ?, models the receiving 
functionality of node i and r, models the relaying functionality of 
node i. 

Suppose that the prescribed link bit rates are given by the vec- 
tor c m , with the capacity for link e being c me . Then in D,„, the 
link from s to r, has capacity c m , Mrj , the link from r, to tj (with 
i + j) has capacity c mr .^, , and the link from r, to f, has infinite 
capacity. Similarly, the link from s to /it (a helper) has capacity 
c /»,.!^;i ( . and the link from h k to tj has capacity c mhk ^, . If the prop- 
agation delay of an edge e exceeds the delay bound, we do not 
include it in the graph. If the propagation delay of a two-hop path 
s — » v (v 6 {/•;} U [h k \) — > tj exceeds the delay bound, we omit the 
edge from v to tj from the graph. As a result, every path from s to 
any receiver tj in the graph has a path propagation delay within the 
delay bound. 

Over such 2-layer sub-graph D,„, we use the same greedy tree 
packing algorithm to pack 2-hop trees connecting the source and 
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Figure 3: Example of packing a unit-capacity tree, starting from s and reaching all receivers t\,h and t^, using our greedy tree packing 
algorithm. 
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Figure 4: Illustration of the 2-layer sub-graph £),„ with a helper 
existing 



every receiver, and it can still achieve the minimum of the min-cuts 
separating the source and receivers in D,„ , which is discribed as 
follows. 

Theorem 2. The tree-packing algorithm in U4V achieves the 
minimum of the min-cuts separating the source and receivers in 
T) m with helpers existing and is expressed as 



R„,(c m ,D) = min ^j minjc,,,,^,,, 



(5) 



vefaM/it) 



Furthermore, the algorithm has a running time ofO(\V\\E^). 



Proof: Refer to Appendix A. 

Similarly, the achieved rate R„,(c„,,D) is a concave function of 
c m as summarized below. 

Corollary 2. In the 2-layer sub-graph D„, with helpers exist- 
ing, the delay-bounded multicast rate R m (c„,,D) obtained by our 
tree-packing algorithm is a concave function of the overlay link 
rates c m . 

Proof: Refer to Appendix B. 

4. OVERLAY LINK RATE CONTROL 
4.1 Considering Both Delay and Loss 

We revise original formulation to design our link rate control 
algorithm with both queuing delay and loss rate taken into account. 



Adapting link rates to both delay and loss allows early detection 
and fast response to network dynamics. 

Consider the following formulation with a penalty term added 
into the objective function of the problem MP: 



MP 



■EQ:max <W(c) = J^ U m (R m (c m , D)) - £ I ' q ,( z 



S.t. 



ajy < C, W e £, 



(7) 



where J q/(z)dz is the penalty associated with violating the ca- 
pacity constraint of physical link / € _£, and we choose the price 
function to be 



q,(z) 



(z - c,r 



(8) 



where (a) + = max{a, Oj. If all the constraints are satisfied, then the 
second term in ifSJl vanishes; if instead some constraints are vio- 
lated, then we charge some penalty for doing so. 

Remark: (i) The problem MP-EQ is equivalent to the original 
problem MP. Because any feasible solution c of these two prob- 
lems must satisfy ajy < Ci, and consequently the penalty term 
in the problem MP-EQ vanishes. Therefore, any optimal solu- 
tion of the original problem MP must be an optimal solution of 
the problem MP-EQ and vice versa, (ii) It can be verified that 

7 

~Y*ie£.L' qi{z)dz is a concave function in c; hence, 1/(c) is a 
linear combination of concave functions and is concave. However, 
because R m (c m , D) is the minimum min-cut of the overlay graph D,„ 
with link rates being c,„, 11(c) is not a differentiable function 1151 . 

We apply Lagrange dual approach to design distributed algo- 
rithms for the problem MP-EQ. The advantage of adopting dis- 
tributed rate control algorithms in our system is that it allows robust 
adaption upon unpredictable network dynamics. 

The Lagrange function of the problem is given by: 

M „ a Jy 

Q(c,p) ± Y J U m (R,„(c, l „D))-Y l I qi(z)dz- 

m=l le£ ^° 

YjPiiaJy-Q), (9) 

le£ 

where p t > is the Lagrange multiplier associated with the capac- 
ity constraint in Q of physical link L p\ can be interpreted as the 



price of using link /. Since the problem MP-EQ is a concave opti- 
mization problem with linear constraints, strong duality holds and 
there is no duality gap. Any optimal solution of the problem and 
one of its corresponding Lagrangian multiplier is a saddle point of 
Q (c, p) and vice versa. Thus to solve the problem MP-EQ, it suf- 
fices to design algorithms to pursue saddle points of Q (c, p). 

4.2 A Loss-Delay Based Primal- Subgradient- 
Dual Algorithm 

There are two issues to address in designing algorithms for pur- 
suing saddle points of Q(c,p). First, the utility function 11(c) 
(and consequently Q(c,p)) is not everywhere differentiable. Sec- 
ond, 11(c) (and consequently Q(c,p)) is not strictly concave in c, 
thus distributed algorithms may not converge to the desired saddle 
points under multi-party conferencing settings J4). 

To address the first concern, we use subgradient in algorithm 
design. To address the second concern, we provide a convergence 
result for our designed algorithm. 

To proceed, we first compute subgradients of 11(c). The propo- 
sition below presents a useful observation. 

Proposition 1 . A subgradient of H(c) with respect to c m>e for 
any e e E and m = 1 , . . . M is given by 

i,>(R, dR - v (a Jy- Ci)+ 



, a I,e 



ay 



where ^— is a subgradient of R m (c m ,D) with respect to c,„ ]( ,. 

Proof: Refer to Appendix C. 

Motivated by the pioneering work of Arrow, Hurwicz, and Uzawa 
1161 and the followup works 1171 1181 , we propose to use the fol- 
lowing primal-subgradient-dual algorithm to pursue the saddle point 
of g(c,p)NeeE, m = 1....M, V/e£, 
Primal-Subgradient-Dual Link Rate Control Algorithm: 
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where a > represents a constant the step size for all the iterations, 
and function 



[«: 



max(0, b), a < 0; 
b, a > 0. 



We have the following observations for the control algorithm in 
GOJ-OJ}: 

• It is known that T*ie£ a i,e— L ~r can t> e interpreted as the 

packet loss rate observed at overlay link e 1191 . The intuitive 
explanation is as follows. The term (ajy - C/) + is the excess 

traffic rate offered to physical link /; thus — '—? models the 

ajy 

fraction of traffic that is dropped at /. Assuming the packet 
loss rates are additive (which is a reasonable assumption for 
low packet loss rates), the total packet loss rates seen by the 

overlay link e is given by Y*ie£ a i,e — '— t • 

• It is also known that p\ updating according to jl 11 1 can be in- 
terpreted as queuing delay at physical link / [20] . Intuitively, 



if the incoming rate ajy > Ci at /, then it introduces an ad- 

j ^ 
ditional queuing delay of ' c for I. If otherwise the term 

ajy < Ci, then the present queueing delay is reduced by an 

amount of — ^— unless hitting zero. The total queuing de- 
lay observed by the overlay link e is then given by the sum 

2ile£ a t,<-Pt- 

• It turns out that the utility function, the subgradients, packet 
loss rate and queuing delay are sufficient statistics to update 
C mje independently of the updates of other link rates. This 
way, we can solve the problem MP-EQ without knowing the 
physical network topology and physical link capacities. 

The algorithm in d 1 01>-<l lib is similar to the standard primal-dual 

algorithm, but since K(c) is not differentiable everywhere, we use 
subgradient instead of gradient in updating the overlay link rates 
c. If we fix the dual variables p, then the algorithm in ([TO) cor- 
responds to the standard subgradient method 1211 . It maximizes a 
non-differentiable function in a way similar to gradient methods for 
differentiable functions — in each step, the variables are updated in 
the direction of a subgradient. However, such a direction may not 
be an ascent direction; instead, the subgradient method relies on 
a different property. If the variable takes a sufficiently small step 
along the direction of a subgradient, then the new point is closer to 
the set of optimal solutions. 

Establishing convergence of subgradient algorithms for saddle- 
point optimization is in general challenging |17| . We explore con- 
vergence properties for our primal-subgradient-dual algorithm in 
the following theorem. 

Theorem 3. Let (c*,p*) be a saddle point of Q(c, p), and Q (k) 
be the average function value obtained by the algorithm in \10\ - 
{77} after k iterations: 



Suppose \u' m (R,„(c m ))\ < U, Vm 
then we have 



'lak 



M, where U is a constant, 

A 2 



1,... 



-a< Q {1 
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§{c%pl -Yk + YT£ xC ' 1 



where B { = \\c (0) - c'f and B 2 = [p {0) - p'f diag (C,, I 6 £) [p (0) - p* 
are two positive distances depending on (c 1 - *, // 0) ), and A is a pos- 
itive constant depending on and (c (0, ,/; (0) ). 

Proof: Refer to Appendix D. 

Remarks: (i) The results bound the time-average Lagrange func- 
tion value obtained by the algorithm to the optimal in terms of dis- 
tances of the initial iterates (c (0) ,p <0) ) to a saddle point. In particu- 
lar, the averaged function values Q (k) converge to the saddle point 
value Q (c* ,p*) within a gap of max (a, max;<=£ Cj l \ ^-, at a rate of 
Ilk. (ii) The requirement of the utility function is easy to satisfied; 
one example is U,„(z) = log(z+e) with e > 0. (iii) Our results gener- 
alize the one in 1171 in the sense that the one in 1171 only applies to 
the case of uniform step size, while we allow different p\ to update 
with different step size ^r- , which is critical for pi to be interpreted 
as queuing delay and thus practically measurable. Our results also 
have less stringent requirement on the utility function than the one 
in |17| . (iv) Although the results may not warranty convergence 
in the strict sense, our experiments over LAN testbed and on the 
Internet in Section [6] show the algorithm quickly stabilizes around 
optimal operating points. Obtaining stronger convergence results 



that confirm our practical observations are of great interests and is 
left for future work. 

4.3 Computing Subgradients of R m {c m ,D) 

A key to implementing the Primal-Subgradient-Dual algorithm 
is to obtain subgradients of R m (c m ,D). We first present some pre- 
liminaries on subgradients, as well as concepts for computing sub- 
gradients for R,„(c,„, D). 

Definition 1 . Given a convex function f, a vector £ is said to 
be a subgradient off at x e dom/ if 

f(x') > f(x) + f{x - x), Mx e dom/, 

where dom/ = {x 6 R"||/(x)| < ooj represents the domain of the 
function f. 

For a concave function /, -/ is a convex function. A vector £ is 
said to be a subgradient of / at x if — £ is a subgradient of -/. 

Next, we define the notion of a critical cut. For session m, let 
its source be s„, and receiver set be V„, c V - {s„,}. A partition of 
the vertex set, V = Z U Z with .v,„ e Z and t e Z for some / € V,„, 
determines an i„,-/-cut. Define 

S(Z)±\(i,j)eE\ieZ,jez} 

be the set of overlay links originating from nodes in set Z and going 
into nodes in set Z. Define the capacity of cut (Z, Z) as the sum 
capacity of the links in 6{Z): 



A subgradient of R„,(c,„, D) with respect to c m , e is given by 
dR m (c„„D) 



P(Z)=Z 



eeS(Z) 



Definition 2. For session m, a cut (Z,Z) is an s m -V m critical cut 
if it separates s„, and any of its receivers and p{Z) = R m (c„,,D). 




Figure 5: Critical cut example. Source s and its two receivers t\,t% 
are connected over a directed graph. The number associated with a 
link represents its link capacity. 

We show an example to illustrate the concept of critical cut. 
In Fig. [5] s is a source, and ti, t 2 are its two receivers. The 
minimum of the min-cuts among the receivers is 2. For the cut 
({s,h[,h 2 ,ti}, {t 2 }), its S({ s, hi,h 2 , f| )) contains links (hi,t 2 ) an d (h 2 , t 2 ), 
each having capacity one. Thus the cut ({s, h\,h 2 ,t\\, \t 2 \) has a ca- 
pacity of 2 and it is an s - (t\,t 2 ) critical cut. 

With necessary preliminaries, we turn to compute subgradients 
of R m (c„„ D). Since R„,(c m , D) is the minimum min-cut of s,„ and 
its receivers over the overlay graph D,„, it is known that one of its 
subgradients can be computed in the following way 1151 . 

• Find an s„,-V m critical cut for session m, denote it as (Z,Z). 
Note there can be multiple S m -V m critical cuts in graph 2),„, 
and it is sufficient to find any one of them. 



dc m 



1, if eeS(Z); 

0, otherwise. 



(12) 



In our system, these subgradients are computed by the source of 
each session, after collecting the overlay-link rates from each re- 
ceiver in the session. More implementation details are in Section[5] 

5. PRACTICAL IMPLEMENTATION 

Using the asynchronous networking paradigm supported by the 
asynchronous I/O library (called asio) in the Boost C++ library, 
we have implemented a prototype of Celerity, our proposed multi- 
party conferencing system, with about 17, 000 lines of code in C++. 

Celerity consists of three main logic components: link rate con- 
trol module, tree-packing and critical cut calculation module, and 
data multicast engine. Fig. [6] describes the relationship between 
these components and where they physically reside. The pseudo 
codes of link rate control and data multicast are described in Algo- 
rithm 1 and Algorithm 2, respectively, in Appendix. 

In the following, we describe the functionality implemented by 
peers, some critical implementations and operation overhead. 
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Figure 6: System architecture of Celerity. 

5.1 Peer Functionality 

In our Celerity implementation, all peers are to perform the fol- 
lowing functions: 

• Peers in broadcast trees forward packets received from its 
upstream parent to its downstream children. Sufficient infor- 
mation about downstream children in the tree is embedded 
in the packet header, for a packet to become "self-routing" 
from the source to all leaf nodes in a tree. 

• Every 200 ms, each peer calculates the loss rate and queuing 
delay of its incoming links and adjusts the rates of its incom- 
ing links based on the link rate control algorithm, and then 
sends them to their corresponding upstream senders for the 
new rates to take effect. 

• Every 300 ms, each peer sends the link state (including allo- 
cated rate and Round Trip Time) of all its outgoing links for 
each session to the source of the session. 

Upon receiving link states for all the links, the source of each ses- 
sion uses the received link rates and the delay information to pack 



a new set of delay-bounded trees, and starts transmitting session 
packets along these trees. We set the delay bound to be 200 ms 
when packing delay-bounded trees in our implementation. When 
a source packs delay-bounded trees, it also calculates one critical 
cut and the derivative of the utility for its session based on the allo- 
cated link rates and the delay information. In addition, the source 
embeds the information about the critical cut and the derivative of 
the utility in the header of outgoing packets. When these packets 
are received, a peer learns the derivative of the utility and whether 
a link belongs to the critical cut or not; it then adjusts the link rate 
accordingly. 

In the following, We use the example in Fig.[TJto further explain 
how Celerity works. 

For an overlay link e e E, say B — » C, The tail node C is re- 
sponsible for controlling c Ae , the rate allocated to session A. To 
do so, C works with the head node B to measure the packet loss 
rate and queuing delay experienced by session A's packets over e 
(B — » C). This can be done by B attaching local sequence numbers 
and timestamps to session A's packets and C calculating the miss- 
ing sequence numbers and the one-way-delay based on the times- 
tamps (4). C also receives other needed control plane information 
from the source of session A, such as the critical cut information 
and the derivative of the utility, along with the data packets arrived 
at C. With the loss rate and queuing delay for session A's packets, 
as well as these control plane information, C adjusts the allocated 
rate c a .b^c using the algorithm in <l 1 Ob-J lib and sends it to B for the 
new rate to take effect. 

Every 300ms, The head node of each overlay link e reports the 
allocated rates c me and the overlay link round-trip-time information 
to the source peers. Take the overlay link B — > C for example, B 
reports the allocated rate c AB ^c and the round-trip-time informa- 
tion of this link to source A. With the collected link state informa- 
tion, source peer A packs delay-bounded trees using the algorithm 
described in Section [3] calculates critical cuts using the method 
explained in Section R31 and the derivative of the utility, and then 
delivers data and the control-plane information to the peers along 
the trees. 

5.2 Critical Cut Calculation 

The calculation of critical cuts, i.e., the subgradient of R m (c,„, D), 
is the key to our implementation of the primal subgradient algo- 
rithm. There can be multiple critical cuts in one session, but it is 
sufficient to find any one of them. Since the source collects allo- 
cated rates of all overlay links in its own session, it can calculate the 
min-cut from the source to every receiver, and record the cut that 
achieves the min-cut. Then, the source compares the capacities of 
these min-cuts, and the cut with the smallest capacity is a critical 
cut. 

5.3 Utility Function 

With respect to the utility function in our prototype implemen- 
tation, the PSNR (peak signal-to-noise ratio) metric is the de facto 
standard criterion to provide objective quality evaluation in video 
processing. We observed that the PSNR of a video stream coded at 
a rate z can be approximated by a logarithmic function B log(z + <5), 
in which a higher B represents videos with a larger amount of mo- 
tion. 8 is a small positive constant to ensure the function has a 
bounded derivative for z > 0. Due to this observation, we use a 
logarithmic utility function in our implementation. 

5.4 Opportunistic Local Loss Recovery 

Providing effective loss recovery in a delay-bounded reliable broad- 
cast scenario, such as multi-party conferencing, is known to be 



challenging 1221 . It is hard for error control coding to work effi- 
ciently, since different receivers in a session may experience dif- 
ferent loss rates and thus choosing proper error control coding pa- 
rameters to avoid unnecessary waste of throughput is non-trivial. If 
re-broadcasting the lost-packets is in use, it introduces additional 
delay and may cause packets missing deadlines and become use- 
less. 

In our implementation, we use network coding 1221 1231 to allow 
flexible and opportunistic local loss recovery. For each overlay link 
e, if the trees of a session m do not exhaust c m , e , the overlay-link 
rate dedicated for the session, then we send coded packets (i.e., lin- 
ear combination of received packets of corresponding session) over 
such link e. As such, receiver of the overlay link e can recover the 
packets that are lost on link e locally by using the network coded 
packets. This way, Celerity provides certain flexible local loss re- 
covery capability without incurring delay due to retransmission. 

5.5 Fast Bootstrapping 

Similar to TCP's Slow Start strategy, we implement a method 
in Celerity called "quick start" to quickly ramp up the rates of all 
sessions during conference initialization stage. The purpose is to 
quickly bootstrap the system to close-to-optimal operating points 
when the conference just starts, during which period peers are join- 
ing the conference and nothing significant is going on. We achieve 
this by using larger values for B in the utility functions and a large 
step size in link rate adaptation during the first 30 seconds. After 
the initialization stage, we reset B and step sizes to proper values 
and allow our system converge gradually and avoid unnecessary 
performance fluctuation. 

5.6 Operation Overhead 

There are two types of overhead in Celerity: ( 1 ) packet overhead: 
the size of the application-layer packet header is around 46 bytes 
per data packet, including critical cut information, the derivative of 
the utility, packet sequence number, coding vector, timestamp and 
so on. (2) link-rate control and link-state report overhead: every 
200 ms, each peer adjusts the rates of its incoming links and sends 
them to their corresponding upstream senders. In our implemen- 
tation, such rate-control overhead is 0.2 kbps per link per session. 
For the link state report overhead, each peer sends the link state of 
all its outgoing links for each session to the source of the session 
every 300 ms. In our implementation, for each peer, such link-state 
report overhead is 0.158 kbps per link per session. In Section RT2l 
we report an overall operational overhead of 3.9% in our 4-party 
Internet experiment. 

6. EXPERIMENTS 

We evaluate our prototype Celerity system over a LAN testbed 
as well as over the Internet. The LAN experiments allow us to 
(i) stress-test Celerity under various network conditions; (ii) see 
whether Celerity meets the design goal - delivering high delay- 
bounded throughput and automatically adapting to dynamics in the 
network; (iii) demonstrate the fundamental performance gains over 
existing solutions, thus justifying our theory-inspired design. 

The Internet experiments allow us to further access Celerity^ 
superior performance over existing solutions in the real world. 

6.1 LAN Testbed Experiments 

We evaluate Celerity over a LAN testbed illustrated in Fig. [7] 
where four PC nodes (A, B, C, D) are connected over a LAN dumb- 
bell topology. The dumbbell topology represents a popular sce- 
nario of multi-party conferencing between branch offices. It is also 
a "tough" topology - existing approaches, such as Simulcast and 



Mutualcast, fail to efficiently utilize the bottleneck bandwidth and 
optimize system performance. 




Figure 7: The "tough" dumbbell topology of the experimental 
testbed. Two conference participating nodes A and B are in one 
"office" and another twos nodes C and D are in a different "office". 
The two "offices" are connected by directed links between gate- 
way nodes E and F, each link having a capacity of 480 kbps. Link 
propagation delays are negligible. 

In our experiments, all four nodes run Celerity. We run a four- 
party conference for 1000 seconds and evaluate the system perfor- 
mance. In order to evaluate Celerity's performance in the pres- 
ence of network dynamics, we reduce cross traffic and introduce 
link failures during the experiment. In particular, we introduce an 
80kpbs cross-traffic from node E to node F between the 300th sec- 
ond and the 500th second, reducing the available bandwidth be- 
tween E and F from 480 kbps to 400 kbps. Further, starting from 
the 700th second, we disconnect the physical link between A and 
E; this corresponds to a practical situation where node A suddenly 
cannot directly communicate with nodes outside the "office" due to 
middleware or configuration errors at the gateway E. 

Figs. |9a|9d| show the sending rate of each session (one session 
originates from one node to all other three nodes). For compari- 
son, we also show the maximum achievable rates by Simulcast and 
Mutualcast, as well as the optimal sending rate of each session cal- 
culated by solving the problem in <|2j— (£3j using a central solver. 
Fig. |9el shows the utility obtained by Celerity and its comparison to 
the optimal. Fig.|9f|shows the average end-to-end delay and packet 
loss rate of session A. Delay and loss performance of other sessions 
are similar to those of session A. 

In the following, we explain the results according to three differ- 
ent experiment stages. 
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Figure 8: Session A's trees used by Celerity (upon convergence), 
Mutualcast and Simulcast in the dumbbell topology, in the absence 
of network dynamics. 

6.1.1 Absence of Network Dynamics 

We first look at the first 300 seconds when there is no cross traffic 
or link failure. In this time period, the experimental settings are 



symmetric for all participating peers; thus the optimal sending rate 
for each session is 240 kbps. 

As seen in Figs. |9a|9d| Celerity demonstrates fast convergence: 
the sending rate of each session quickly ramps up to 95% to the 
optimal within 50 seconds. Fig. |9el shows that Celerity quickly 
achieves a close-to-optimal utility. These observations indicate any 
other solution can at most outperform Celerity by a small margin. 

As a comparison, we also plot the theoretical maximum rates 
achievable by Simulcast and Mutualcast in Figs. |9a|9d| We ob- 
serve that within 20 seconds, our system already outperforms the 
maximum rates of Simulcast and Mutualcast. 

Upon convergence, Celerity achieves sending rates that nearly 
double the maximum rate achievable by Simulcast and Mutualcast. 
This significant gain is due to that Celerity can utilize the bottleneck 
resource more efficiently, as explained below. 

In Fig. [8] we show the trees for session A that are used by Celer- 
ity, Mutualcast and Simulcast in the dumbbell topology. As seen, 
Simulcast and Mutualcast only explore 2-hop trees satisfying cer- 
tain structure, limiting their capability of utilizing network capacity 
efficiently. In particular, their trees consumes the bottleneck link 
resource twice, thus to deliver one-bit of information it consumes 
two-bit of bottleneck link capacity. For instance, the tree used by 
Simulcast has two branches A — » C and A — > D passing through 
the bottleneck links between E and F, consuming twice the critical 
resource. Consequently, the maximum achievable rates of Simul- 
cast and Mutualcast are all 120 kbps. In contrast, Celerity explores 
all 2-hop delay-bounded trees, and upon convergence utilizes the 
trees that only consume bottleneck link bandwidth once, achieving 
rates that are close to the optimal of 240 kbps. 

Fig. |9f| shows the average end-to-end delay and packet loss rate 
of session A. As seen, the packet loss rate and delay are high 
initially, but decreases and stabilizes to small values afterwards. 
The initial high loss rate is because Celerity increases the sending 
rates aggressively during the conference initialization stage, in or- 
der to bootstrap the conference and explore network resource lim- 
its. Celerity quickly learns and adapts to the network topology, 
ending up with using cost-effective trees to deliver data. After the 
initialization stage, Celerity adapts and converges gradually, avoid- 
ing unnecessary performance fluctuation that deteriorates user ex- 
perience. By adapting to both delay and loss, we achieve low loss 
rate upon convergence as compared to the case when only loss is 
taken into account 1241 . 

6.1.2 Cross Traffic 

Between the 300th second and the 500th second, we introduce 
an 80kpbs cross-traffic from node E to node F. Consequently, the 
available bottleneck bandwidth between E and F decreases from 
480 kbps to 400 kbps. We calculate the optimal sending rates dur- 
ing this time period to be 200 kbps for sessions A and B, and remain 
240 kbps for sessions C and D. 

Seen in Figs. |9a|9d| Celerity quickly adapts to the bottleneck 
bandwidth reduction. Celerity's adaptation is expected from its de- 
sign, which infers from loss and delay the available resource and 
adapt accordingly. From Fig. [9f| we can see a spike in session A's 
packet loss rate around 300th second, at which time the available 
bottleneck bandwidth reduces. The link rate control modules in 
Celerity senses this increased loss rate, adjusts, and reports the re- 
duced (overlay) link rates to node A. Upon receiving the reports, the 
tree-packing module in Celerity adjusts the source sending rate ac- 
cordingly, adapting the system to a new close-to-optimal operating 
point. At 500th second, the cross traffic is removed and the avail- 
able bottleneck bandwidth between E and F restores to 480kbps. 
Celerity also quickly learns this change and adapts to operate at the 



original point, evident in Figs. |9a|9b| 

6.1.3 Link Failure 

Between the 700th second and the 1000th second, we disconnect 
the physical link between A and E. Consequently, node A cannot 
use the 2-hop threes with node C (D) being intermediate nodes; 
similarly node C (D) cannot use the 2-hop threes with node A be- 
ing intermediate nodes. They can, however, still use the trees with 
node B as intermediate nodes. We compute the theoretical optimal 
sending rates during this time period to be 240 kbps for all sessions. 

We observe from Fig. |9a]that node A's sending rate first drops 
immediately upon link failure, then quickly adapts to the new oper- 
ating point of around 120 kbps, only half of the theoretical optimal. 
This is because Celerity only explores 2-hop trees for content de- 
livery while in this case 3-hop trees (e.g., A — > B — » C — » D) are 
needed to achieve the optimal. It is of great interest to explore 
source rate control mechanisms beyond this 2-hop tree-packing 
limitation to further improve the performance without incurring ex- 
cessive overhead. 

In Figs.|9d] we observe the sending rate of session D first drops 
and then climbs back. This is because session D happens to use 
the trees with node A being intermediate nodes right before the link 
failure. The link failure breaks session D's trees, thus session D's 
rate drops dramatically. Celerity detects the significant change and 
adapts to use the trees with B as intermediate nodes for session 
D. Session D's rate thus gradually restore to around the optimal. 
These observations show the excellent adaptability of Celerity to 
abrupt network condition changes. 

As a comparison, we observe that Simulcast's maximum achiev- 
able rates of session A, C, and D all drop to zero upon the link 
failure. This is because there is no direct overlay link between A 
and C (D) after the link failure. Consequently, Simulcast is not 
able to broadcast the source's content to all the receivers in these 
sessions, resulting in zero session rates. 

6.2 Internet Experiments 

Beside the prototype Celerity system, we also implement two 
prototype systems of Simulcast and Mutualcast, respectively. Both 
Celerity and Mutualcast use the same log utility functions in their 
rate control modules. We evaluate the performance of these sys- 
tems in a four-party conferencing scenario over the Internet. 

We use four PC nodes that spread two continents and tree coun- 
tries to form the conferencing scenario. Two of the PC nodes are 
in Hong Kong, one is in Redmond, Washington, US, and the last 
one is in Toronto, Canada. This setting represents a common global 
multi-party conferencing scenario. 

We run multiple 15-minute four-party conferences using the pro- 
totype systems, in a one-by-one and interleaving manner. We select 
one representative run for each system, and summarize their perfor- 
mance in Fig. [10] 

Figs. |10a|10d| show the rate performance of each session. (Re- 
call that a session originates from one node to all other three nodes.) 
As seen, all the session rates in Celerity quickly ramp up to near- 
stable values within 50 seconds, and outperforms Simulcast within 
10 seconds. Upon stabilization, Celerity achieves the best through- 
put performance among the three systems and Simulcast is the 
worst. For instance, all the session rates in Celerity is 2x of those 
in Simulcast and Mutualcast, except in session C where Mutualcast 
is able to achieve a higher rate than Celerity. 

We further observe Celerity's superior performance in Fig. |10e| 
which shows the aggregate session rates, and in Fig. II Of I which 
shows the total achieved utilities. In both statistics, Celerity out- 
performs the other two systems by a significant margin. Specifi- 



cally, the aggregate session rate achieved by Celerity is 2.5x of that 
achieved by Simulcast, and is 1.8x of that achieved by Mutualcast. 

These results show that our theory-inspired Celerity solution can 
allocate the available network resource to best optimize the system 
performance. Mutualcast aims at similar objective but only works 
the best in scenarios where bandwidth bottlenecks reside only at 
the edge of the network (4). 

Figs. |10g|l Oil show the average end-to-end loss rate and delay 
from source to receivers for session A, session C and session D. 
The results for session B is very similar to session A and is not in- 
cluded here. As seen, the average end-to-end delays of all sessions 
are within 200 ms, which is our preset delay bound for effective 
interactive conferencing experience. The average end-to-end loss 
rate for all sessions are at most l%-2% upon system stabilization. 

The overall operation overhead of Celerity in the 4-party Inter- 
net experiment is around 3.9%. In particular, the packet overhead 
accounts for 3.4%, and the link-rate control and link-state report 
overhead is around 0.5%. 

7. CONCLUDING REMARKS 

With the proliferation of front-facing cameras on mobile devices, 
multi-party video conferencing will soon become an utility that 
both businesses and consumers would find useful. With Celerity, 
we attempt to bridge the long-standing gap between the bit rate 
of a video source and the highest possible delay-bounded broad- 
casting rate that can be accommodated by the Internet where the 
bandwidth bottlenecks can be anywhere in the network. This paper 
reports Celerity solution as a first step in making this vision a real- 
ity: by combining a polynomial-time tree packing algorithm on the 
source and an adaptive rate control along each overlay link, we are 
able to maximize the source rates without any a priori knowledge 
of the underlying physical topology in the Internet. Celerity has 
been implemented in a prototype system, and extensive experimen- 
tal results in a "tough" dumbbell LAN testbed and on the Internet 
demonstrate Celerity's superior performance over the state-of-the- 
art solution Simulcast and Mutualcast. 

As future work, we plan to explore source rate control mecha- 
nisms beyond the 2-hop tree-packing limitation in Celerity to fur- 
ther improve its performance without incurring excessive overhead. 

APPENDIX 

A. Proof of Theorem 2 

Proof: Firstly, we prove the minimum of the min-cuts separating 
the source and receivers in T) m can be expressed as 
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In the overlay graph D,„, the minimum of the min-cuts is min, eT 

MinCutis, tj). where T is the set of receivers, and MinCutis, tn 
is the min-cut separating the source s and receiver tj. The min- 
cut separating the source and a receiver can be achieved by finding 
the maximum unit-capacity disjoint paths from the source to the 
receiver. The structure of the graph D„, is so special that for each 
receiver tj we can compute the maximum number of edge-disjoint 
paths from j- to tj easily. 

In the graph D,„ we represent each edge with capacity m by m 
parallel edges, each with unit capacity. For each receiver node, say 
tj, due to the special structure of the graph, we can find these edge- 
disjoint paths in a very simple way. Since there are only 2-hop 
paths in the graph D,„, so a path from s to tj must go through one 
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Figure 9: Performance of Celerity over a dumbbell LAN testbed. (a)-(d): Sending rates and receiving rates of individual sessions, (e): Utility 
value achieved compared to the optimum, (f): End-to-end delay and loss rate of session A. 



of the intermediate nodes. Thus for each intermediate node, say e , 
we can find min |c,„,j_, e , c m(C _, ( .l edge-disjoint paths from s to e and 
then to tj. Therefore, we can have 

MinCut (s, f,) = ^ min {c„,,.„,,, c„,,,,^ f . } 

Consequently, the minimum of the min-cuts separating the source 
and receivers can be expressed as 
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Next, we prove the tree packing algorithm can achieve the mini- 
mum of the min-cuts separating the source and receivers in the two 
layer graph D m . This tree packing algorithm is developed based on 
the Lovasz's constructive proof (14) to Edmonds' Theorem 1251 . 
To proceed, we firstly apply the Lovasz's constructive proof to our 
two layer graph D,„ and based on the proof, we can directly have 
the tree packing algorithm. 

Notations: Let G be a digraph with a source a. We assume 
all edges have unit-capacity and allowing multiple edges for each 
ordered node pair. V(G) and E(G) denote its vertex set and edge set. 
A branching (rooted at a) is a tree which is directed in such a way 
that each receiver /, has one edge coming in. A cut of G determined 
by a set 5 c V(G) is the set of edges going from S to V(G) - S and 
will be denoted by Ac (5), we also set 6g(S) = | A G (S)\. 

Theorem: In the two layer graph D,„, if <5 (5) > k for every 
5 c V(G), a e S,3tj e V(G) - S then there are k edge-disjoint 
branchings rooted at a. 

Lovasz 's constructive proof: We use induction on k. It is obvious 
that the theorem holds when k = 0. 

Let F be a set of edges satisfying the following coditions 



(i) F is an arborescence rooted at a. 

(Definition: In graph theory, an arborescence is a directed graph 
in which, for a vertex u called the root and any other vertex v, there 
is exactly one directed path from u to v. Equivalently, an arbores- 
cence is a directed, rooted tree in which all edges point away from 
the root. Every arborescence is a directed acyclic graph (DAG), but 
not every DAG is an arborescence.) 

(ii) S G -f(S ) > k - 1 for every 5 c V(G), a e S, 3t t e V(G) - S . 

If F cover all receivers f,, i.e., it is a branching then we are fin- 
ished: G-F contains k— 1 edge-disjoint branchings and F is in the 
kth one. 

If F only covers a set T c V(G), which do not cover all receivers, 
i.e., there exist some receivers f, i T . We show we can add an edge 
e £A G (T) to F so that the arising arborescence F + e still satisfies 
(i) and (ii). Noting that if r, e T, then t, e T, because there are 
infinite unit-capacity edges from r, to £,, adding a edge from r, to f, 
to F can still satisfies (i) and (ii). 

Consider a maximal set A c V(G) such that 

(a) a e A; 

(b) There is at least one receiver t, <£ AuT; 
{c)S G - F {A)=k-\. 

If no such A exists any edge 

e e {(r h tj)\ ri eT,tjeV(G)-T} 
U{Qn, tj)\hi eT,tjE V(G) - T] 
U{(a, rj)}\tj iT)VJ {(a, h,)\h, t T) 

can be added to F. 
Otherwise, 
Since 

S G -f(A UT) = 6 a (A \JT)>k, 
we have AuT ± A,T <£A. Also, 
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Figure 10: Performance of four-party conferences over the Internet, running prototype systems of Celerity, Simulcast, and the scheme in J4] . 
(a)-(d): Throughput of individual sessions, (e): Total throughput of all sessions, (f): Utility achieved by different systems, (g)-(h): End-to-end 
delay and loss rate of session A, C, and D for the Celerity system. 



8 G - F (A U T) > d a - F (A) 

and so, there must be an edge e = (x, y) which belongs to Ag-f 
(A U T)- a g _ f (A). Hence x e T - A and y e V(G) - T - A. We 
claim e can be added to F, i.e., F + e satisfies (i) and (ii). 

Noting that due to the special structure of D m , 



which follows by an easy counting. 

Since a 6 S D A, and there exist a receiver y e V(G) - S n A, we 
have 

S G - F (A) = k-l, 6 G -f(S CiA)>k- 1, 
and by the maximality of A, 



e = (x,y)e{(r„t j )\r l eT-A,t J eV(G)-T-A} 
Uj(/i„ tj)\hj eT-A,tjE V(G) - T - A) 

So y must be a receiver. 

It is obvious that F + e still satisfies (i) . 

Let 5 c V(G), aeS, 3f, e V(G) -S.Ife iA G - F (S) then 

6 G - F - t (jS) = 6 -F(.S)>k-l. 

If e 6A G _ F (5) then x e S, y e V(G) - S . We use the inequality 

<W(S u A) + S G - F (S n A) < Sg- F (S) + S G - F (A) (13) 



6 G - F (S UA)>k, 

since SuA^AasxeS-A and there is at least one receiver 
y $ (S U A) U T as y $ S U A, y g T. Thus (fj3) implies 



6 G - F (S)>k 



and so, 



S G - F - e (S)>k-l. 

Thus, we can increase F till finally it will satisfy (i), (ii) and 



reach all receivers f,-. Then apply the induction hypothesis on G—F. 
This completes the proof. 

■ 
The obove proof yields an efficient algorithm to construct a max- 
imum set of edge-disjoint trees reaching all receivers. Let 



K(G) = min 6 G (S) 

ScV(G),aeS,3tieV{G)-S 

These trees can be constructed edge by edge. At any stage, we 
can increase F by checking at most E(G) edges e whether or not 



K{G-F-e)>k- 1. 

Since determining K(G) can be done in p steps, where p is a 
polynomial in V(G), E(G). Thus, we can obtain k edge-disjoint 
trees in at most 0{pE(G)) steps. 

Over the two layer graph D„„ The algorithm packs unit-capacity 
trees one by one. Each unit-capacity tree is constructed by greed- 
ily constructing a tree edge by edge starting from the source and 
augmenting towards all receivers. It is similar to the greedy treep- 
acking algorithm based on Prim's algorithm. The distinction lies in 
the rule of selecting the edge among all potential edges. The edge 
whose removal leads to least reduction in the multicast capacity of 
the residual graph is chosen in the greedy algorithm. 

Because we alway choose the edge whose removal leads to least 
reduction in the multicast capacity of the residual graph, the edge 
we choose can alway satisfy K(G — F — e)>k—\. Therefore, based 
on the above proof, finally we can obtain k edge-disjoint trees. 

Due to the special structure of D,„, the time complexity of com- 
puting K(G) is 0(V(G) * E(G)). Therefore, the time complexity of 
the algorithm is 0(V(G) * E 2 (G)). 

■ 

Proof of the inequality (13\ . 

Proof: suppose e = (x, y) eAq-f (S U A), then x e S U A, and 
y £ V(G) - S - A, thus we must have 

e eAc^r (5)U A G _ F (A). 

Similarly, suppose e = (x, y) eAq-f (S n A), then x e S (1 A, and 
y e V(G) - S n A, we also have 

e 6A c _ f (5)U a G -f (A). 

if e = (x, y) €A G _f (5 U A)n A G _ F (5 n A), then x e S n A, and 
y 6 V(G) - S - A. Therefore we have 

e eA a - F (5)n A G _ f (A). 
Base on the above observation, we can have 



<5 G _ F (S U A) + 8 G - F OS n A) < <5 G _ F (S) + <5 G _ F (A) 



B. Proof of Corollary 2 

Proof: Let a length- \E\ binary vector I x be the indicator vector for 
edge set X c E; its e-th entry is 1 if eeX, and otherwise. 

Since R m (c m ,D) is the minimum min-cut over D m . Therefore it 
can be expressed as 



and c 2 n denote two different link rate vector, and A\ + A 2 = 1, /ii > 
0, A 2 > 0. 

Then we have 

RMicl + A 2 c 2 m , D) = min min I SW )Uicl„ + A 2 c 2 J 

icT U:seU,l,€U 

> min min Igm\{A\c]S) 

ieT V: seU.ticV 

+min min hwMicl,) 

ieT U:seU,tjeO 

= R„ l (A i c), r D)+R m {Aicl,D) 
= A l RJcl,D) + A 2 R m (cl,D) 

So R„,(c m , D) is a concave function of the overlay link rates c,„. 

■ 

C. Proof of Proposition 1 

Proof : For anyeeE and m = 1, ... M, let c,„ c and c me denote two 

pit c (-7 f "1+ 

differnet value. It is easy to verified that - 2; e x L ' dz is 

a concave function and -YiieL a le— L ^r ls its subgradient with 

respect to c me . Therefore, we just need to show U' m (R m ) -^- is a 
subgradient of U m (R m ) with respect to c„ le . 

Since U m (R m ) is a increasing and strictly concave function and 
R„,(c m , D) is a concave function with respect to c,„, which has been 
proved in Corollary 1 . Then we can have 

U m (R\i:)-U m {R^<U' m {R^(R\i:-R^) 



R m_ s <n < £fl2_ (c (i) _ c (2) } 

OC, IKe 

Sincet/,,, (R m ) is nondecreasing, we have U' m lif„ 1 > 0. Then 

u m ( R v)-u m (R$) < u;„(r^(r^-r^) 
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Therefore, U' m (R,„) ^- 2L is a subgradient of U,„ (R,„) with respect 
toc„,, c . 

■ 

D. Proof of Theorem 3 

Proof: Let g = maxp, A=diag(Ci, le£). let (c*, p") be a sad- 

Iie£ ' 

die point of the Lagrangian function Q (c, p). We use Q c (c, p)and 
Q p (c, p) to denote a subgradient of Q (c, p) with respect to c and a 
subgradient of £r(c,/?)wifh respect top. Suppose that \U' m (R,„(c,„))\ 
VmeM is upper bounded by a positive constant U. 

Under the assumption that \U' m (R m (c„,))\ VmeM is upper bounded 
by a positive constant 0, there is a constant A > , such that 
\\ff c (c w ,p w ) || 2 < A, and H0, (c*,/?" ) || 2 < A for all k > 0. 

In order to prove theorem 2, we need to prove the following two 
lemmas. 

Lemma 1: (a) For any c > and all k > 0, 



R m (c,„,D) = min min I 6iU) c„, 

•eT V.seUJtcV 

where 8{U) denote the set of edges going from U tot/. So^,„(c m , D) 
is the pointwise minimum of a family of linear functions. Let c' m 
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2a[0( C «,/,«) 



-g{c,p^)] + a 2 \\g c {c«\ p^)\\ 
(b)For any p > and all k > 0, 



(yM) _ p f A (p<* +1 > - p) < (p« - pf A (p« - p) 

-2[0(<P>,ff»)-ff(<P,p)] 

Proof: (a) From the algorithm (9)-(10), we obtain that for any 
c > and all k > 0, 

||c (t+I) - c|g < \\c (k) + ag c (c (k \p {k) )-c\\i 

= \\c^-c\\ 2 2+ 2ag c (c' k \p^) T (c m -c) 
+a 2 ||^.(c« J p»)|g 

Since the function @{c,p) is concave in c for each p > 0, and 
since ^ c (c®,/»®) is a subgradient of £? (c, p <k) \ with respect to c at 
c = c ( '\ we obtain for any c, 



-1 



a 2 1 *-l 

QrA 1 — 



-^||c <0 » - c||? -°^-< ^§(c®,p®)-§r(c,p(k)) (14) 



(15) 
Proof: by using Corollary 1 and Lemma 1(a), we have for any 
c > and i > 0, 



i-[H C («)- c ||_|| c (o_ c |g]_£ A 2 < ^( c «),/>) 

-£( c ,p ( ") 



2<r 



By adding these relations over i = 0, ..., fe - 1, we obtain for any 
c > 0andA:> 1, 



§ c (c«,p<«) r (c w - c) <ff(c«\p®) - £(c,p<«) 
Hence, for any c > and all k > 0, 

lle^'-clg < || C ®-c|||+2 a [^(c w ,p w ) 

(b) Similarly, from (9)-(10), for any pi > /eX, we have, 

C,\p^-p,f < C:\pf -p l \ 2 -2(p^-p l )g„ l ^ k \p^) 

+ i|^(c<«, / ,W)P 

By adding these relations over all le£.. we obtain for any p > 
and all it > 0. 

{p (k+[> - pf A (p rt+1 > - p) < (p« - p)' A (p« - p) 

-2{p^-pfg p {c«\p^) 

+8\Wp{c < - k \p' k) )\\ 2 2 

Since ^ ( c (A) , p*] is a subgradient of the linear function Q I c ( *' , p) 
atp = p ( ", we have for all p. 

{p {k) - pf P {c {k \p {k) ) =G{c ( - k \p ( - k) )-§{c*\ P ) 

Therefore for any p > and all k > 0. 

(yM) _ p ) r A (p<* + « -p) < (p« - p) r A (p w - p) 



Lemma 2: let c(fc) and p(£) be the iterate averages given by 



c W = Ify\ *® = Ifyi 



we then have for all fc > 1, 
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Since the function C? (c, p) is linear in p for any fixed c > 0, there 
holds 



1 * -1 
^( C ,p(fc)) = -^(c,p" ) ) 



Combining the preceding two relations, we obtain for any c > 
andfc > 1, 



-^|| c «"-c|i-£A 2 <i^(^,p(")-^(c,p W ) 



2ka 



thus establishing relation dl4b . 

Similarly, by using Corollary 1 and Lemma 1(b), we have for any 
p > and i > 0, 

(p ( ' +1) -pf A (p" +1) - p) < (p"» - p) 7 A (p® - p) 

-2[0(c®p») 

-^(c (,) ,p)] + gA 2 

By adding these relations over i = 0, ..., k — 1, we obtain for any 
p > OandA- > 1, 



lf>( C ^)-£ (c ^)]-^ 
K ,=o z 

(p (0 » - p) r A (p (0 > - p) (p<» - p)^ A (p<*» - p) 

2/fc 2k~ 



(p<°>-p) A(f>-p) 



2k 

because the function Q (c,p) is concave in c for any fixed p > 0, 
we have 



Algorithm 1 Link Rate Control 



k-i 
-Y J S{^ !) ,p)<&{c{k\p) 



Combining the preceding two relations, we obtain for any p > 
and A: > 1, 



i=0 



2k 



Our proof of this theorem is based on Lemma 2. In particular, by 
letting c = c* and p = p* in equations J14l > and dl5b . repectively, 
we obtain, 



2ak 



1 v '- c *n?-^ < \fj?(c®,rP)-ev,Kk)) 



1 ~ k' 



T2_j@{c & ,p (i> )-e(.d(.k),p*) < k T + - — 



2k 



By the saddle-point relation, we have 



@{Z{k),p')<9(c%p')<S(c',p(k)) 
Combining the preceding three relations, we obtain for all k > 1, 
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/* 

Every 200ms each peer measures the loss rate and queuing delay of 

its incoming links and gets the source sending rate from the packets 

of corresponding session source it has received and adjusts the rates 

of these links based on the link rate control algorithm, and then 

sends them to their corresponding upstream senders for the new 

rates to take effect. 

5 denotes the set of all sessions. s„, denotes the session of peer m. 

E,„ denotes the set of incoming links of peer m. I„ Ke is the critical 

link indicator of link e for session s,„. If e is a critical link, then 

I me = 1, otherwise, I me = 0. 

*/ 

1 : for all eeE m do 

/*get the loss rate of the link e */ 
2: lossrate*- GetAverageLossQ; 

/*get the queuing delay of the link e */ 
3: queuing -delay*— GetAverageQueuingDelayQ; 

4: for all seS do 

5: if s t s m then 

/* get the source sending rate of session s */ 
6 : sending- rate <— GetS ourceS endingRate() ; 

/* get the critical cut indicator of link e 

for session s */ 
7: l me *— GetCriticalCut(e ,m); 

8 : delta*— step- sizeifil sending- rate 

-lossrate-queuing-delay); 
9: //i'f.push_back(pair<i', delta>); 

10: end if 

1 1 : end for 
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